AIML Online Capstone - Pneumonia Detection Challenge

The goal is to build a pneumonia detection system, to locate the position of inflammation in an image.

Tissues with sparse material, such as lungs which are full of air, do not absorb the X-rays and appear black in the image. Dense tissues such as bones absorb X-rays and appear white in the image. While we are theoretically detecting “lung opacities”, there are lung opacities that are not pneumonia related.

In the data, some of these are labeled “Not Normal No Lung Opacity”. This extra third class indicates that while pneumonia was determined not to be present, there was nonetheless some type of abnormality on the image and oftentimes this finding may mimic the appearance of true pneumonia.

Dicom original images: - Medical images are stored in a special format called DICOM files (*.dcm). They contain a combination of header metadata as well as underlying raw image arrays for pixel data

PART 1 - Exploratory Data Analysis

There are no nulls in Target and patientId columns.

Columns x,y, width and height are having nulls for 20672 rows.

Target column distribution: 20672 are pneumonia negative and 9555 are pneumonia positive. This is good proportion.

Total 132 patients are having more than 1 anomalies.

Total 2454 rows in the csv in all represent num of patches with multiple anomalies.

Let us write some code to load the binary image data in a dictionary

The key of the dictionary would be the patientID and value would be the array of size 32x32 which represents the image of the Xray.

This data would be helpful in further model building

PART 2 - Model Building

Reading the Data Set and EDA

Reading the class Info Data Set

Merging the class and labels data set into training dataset

Displaying Chest Xray Images of Patients who have Pneuomina

Reading the Dicom images meta data and appending it to the training set

MODEL BUILDING

CNN with Tranfer learning using VGG16

CNN with ResNet50

Bounding Box Prediction : UNet

Creating Custom Train Generator. This will read the files in batches of 10 while training the model

UNET USING MobileNet

val_dice-oefficient value is very low and pretty much a flat curve, indicating underfitting indicating model has not learnt sufficiently. It is steadily increasing, not sufficient training(more epochs needed).

We have used an image size of 224x224 as against the original size of 1024x1024. Using a higher resolution, could also improve training capacity

  • Hyper-parameter tuning, image_augmentation, using different architectures will help in increasing model performance and generalization.

  • Recall socre is quite high for pneumonia class(target = 1), even for a very low dice-coefficient score for image mask prediction. This is a good score, because it indicates that 99% of patients who are positive are detected correctly by this model.
  • Precision score is however low indicating only 83% of the predictions are correct. This is largely due to a high amount of false positives, as indicated in the confusion matrix.
  • The accuracy score is 0.85%, w This probably can improve as the model trains with more number of samples, that will help it to distinguish non-pneumonia images better.